Fix flaky test execution caused by `Thread` #34966

ydshieh · 2024-11-27T10:10:51Z

What does this PR do?

As discussed offline:

Would it possible if we change Thread to Process below 🙏 ?
I know it's not good to change code that is working and making sense for the sole purpose of testing, but Thread (without join) here making some test(s) flaky failing
test_cached_model_has_minimum_calls_to_head
where we count the number of calls to Hub. Some other tests that (eventually) calling from_pretrained entering Thread line will affect test_cached_model_has_minimum_calls_to_head
because they are running at the same time in the same process.
There are other solutions, but the changes involves patching (a lot of) tests (so at conftest level which is not good neither IMO.

transformers/src/transformers/modeling_utils.py

Line 3842 in d5cf91b

Thread(

To check the PR works (make sure tensorflow is available):

python gradio2.py

where script.py

for i in range(20):
    import os
    os.system('python -m pytest -v tests/models/auto/test_modeling_auto.py tests/models/auto/test_modeling_tf_auto.py -k "test_from_pretrained_identifier or test_cached_model_has_minimum_calls_to_head"')
    # or running several times on CI workflow: it's flaky and not easy to reproduce
    # or add `time.sleep(0.5)` at the beginning of `auto_convenanorsion`

HuggingFaceDocBuilderDev · 2024-11-27T10:37:49Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

LysandreJik

Thank you!

fix Co-authored-by: ydshieh <[email protected]>

loadams · 2024-12-05T18:24:08Z

Hi @ydshieh - FYI this change appears to have broken some unit tests in DeepSpeed, specifically where we download a model. I'll look at ways to resolve this, I assume it is an issue in multiprocessing/threading.

ydshieh · 2024-12-05T19:37:50Z

@loadams Thank you for informing us. Hope there is a way that could work in both cases 🙏 Let me know if there is anything we can help here.

fix Co-authored-by: ydshieh <[email protected]>

tomaarsen · 2024-12-12T11:17:24Z

@ydshieh I'm also experiencing issues here for Sentence Transformers/Cross Encoder models, see UKPLab/sentence-transformers#3129

In short: multiprocessing.Process never works when not inside of __name__ == "__main__". I recognize that most programs should be using that line, but I'd rather not force it on my users.

If one of my users loads any model that only has a pytorch_model.bin, then it'll fail, e.g.:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("embaas/sentence-transformers-gte-base")

or

from sentence_transformers import CrossEncoder

model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

which internally call

from transformers import AutoModel

model = AutoModel.from_pretrained("embaas/sentence-transformers-gte-base")

or

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("cross-encoder/ms-marco-MiniLM-L-6-v2")

All of these get:

RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

        To fix this issue, refer to the "Safe importing of main module"
        section in https://docs.python.org/3/library/multiprocessing.html

Tom Aarsen

ydshieh · 2024-12-12T16:16:27Z

Hi @ydshieh - FYI this change appears to have broken some unit tests in DeepSpeed, specifically where we download a model. I'll look at ways to resolve this, I assume it is an issue in multiprocessing/threading.

Hi @loadams FYI: it's reverted back to use Thread in #35236 (as there are other issues reported)

loadams · 2024-12-12T21:05:46Z

should be using that line, but I'd rather not force it on m

@ydshieh - thanks for the update!

…ues in tests (#6822) Changes from huggingface/transformers#34966 caused the `nv-torch-latest-v100` tests to fail with the following error: ``` File "/tmp/azureml/cr/j/e4bfd57a509846d6bbc4914639ad248d/exe/wd/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3941, in from_pretrained raise EnvironmentError( OSError: Can't load the model for 'hf-internal-testing/tiny-random-VisionEncoderDecoderModel-vit-gpt2'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'hf-internal-testing/tiny-random-VisionEncoderDecoderModel-vit-gpt2' is the correct path to a directory containing a file named pytorch_model.bin, tf_model.h5, model.ckpt or flax_model.msgpack. ``` Sample failure here: https://github.com/microsoft/DeepSpeed/actions/runs/12169422174/job/33942348835?pr=6794#step:8:3506 This was resolved on the Transformers side here: huggingface/transformers#35236

fix

4dbd6c0

ydshieh requested a review from LysandreJik November 27, 2024 10:31

LysandreJik approved these changes Nov 27, 2024

View reviewed changes

ydshieh merged commit 5f8b24e into main Nov 27, 2024
25 checks passed

ydshieh deleted the fix_flaky_2 branch November 27, 2024 15:32

BernardZach pushed a commit to BernardZach/transformers that referenced this pull request Dec 5, 2024

Fix flaky test execution caused by Thread (huggingface#34966)

8014036

fix Co-authored-by: ydshieh <[email protected]>

loadams mentioned this pull request Dec 5, 2024

Remove pin from transformers version and fix Processing/Threading issues in tests microsoft/DeepSpeed#6822

Merged

BernardZach pushed a commit to innovationcore/transformers that referenced this pull request Dec 6, 2024

Fix flaky test execution caused by Thread (huggingface#34966)

44d75a8

fix Co-authored-by: ydshieh <[email protected]>

tomaarsen mentioned this pull request Dec 12, 2024

Breaking change due to multiprocessing.Process when loading pytorch_model.bin-based model #35228

Closed

ydshieh mentioned this pull request Dec 12, 2024

Change back to Thread for SF conversion #35236

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flaky test execution caused by `Thread` #34966

Fix flaky test execution caused by `Thread` #34966

ydshieh commented Nov 27, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Nov 27, 2024

LysandreJik left a comment

loadams commented Dec 5, 2024

ydshieh commented Dec 5, 2024 •

edited

Loading

tomaarsen commented Dec 12, 2024 •

edited

Loading

ydshieh commented Dec 12, 2024

loadams commented Dec 12, 2024

Fix flaky test execution caused by Thread #34966

Fix flaky test execution caused by Thread #34966

Conversation

ydshieh commented Nov 27, 2024 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Nov 27, 2024

LysandreJik left a comment

Choose a reason for hiding this comment

loadams commented Dec 5, 2024

ydshieh commented Dec 5, 2024 • edited Loading

tomaarsen commented Dec 12, 2024 • edited Loading

ydshieh commented Dec 12, 2024

loadams commented Dec 12, 2024

Fix flaky test execution caused by `Thread` #34966

Fix flaky test execution caused by `Thread` #34966

ydshieh commented Nov 27, 2024 •

edited

Loading

ydshieh commented Dec 5, 2024 •

edited

Loading

tomaarsen commented Dec 12, 2024 •

edited

Loading